Who am I

My name is Tomasz Plata-Przechlewski and I live in Poland. I was born on 16th june 1963 (it was Sunday, the exact day when Valentina Vladimirovna Tereshkova was launched into space – if you know who she is).

BTW in Poland born-in-sunday means work-shy (ie. lazy) person (so you know now first Polish? proverb)

BTW by pure statistics \(1/7 \approx 14\)% of the population is work-shy:-)

I graduated economy long time ago and teached statistics and information systems (mainly). I am a big fan of open source software (or OSS) and I knew a few OSS systems including Linux and LaTeX. And of course R which I am about to show you in a while.

My hobby is Road Cycling and History. A I am also a amateur photographer. (cf tprzechlewski@flickr)

Agenda

Statistics (nothing spectacular, just classical EDA, no (heavy) math, relax)

Statistical software (modern, non-standard or hipster #youcall)

Poland (via statistical examples)

How statistics is taught to students at social-science departments (at least in Poland)

Three components of Statistics:

Theory (models) + Tools (programs) + Practice (real data)

Undergraduate courses in social sciences in Poland concentrate on theory, use Spreadsheet as an universal computing tool Office-like editor (MS Word/OO Writes) as an universal publishing tool. Students works with artificial (clean) and small data sets thus are unaware of problems related to applying theory to practice. The workflow is shown in the following diagram:

It is claimed that the above scenario is the best one. More advanced tools would be too difficult (and time consuming) to be acquinted to by students, thus distracting them from the main subject of the course, ie statistical methods.

How statistics should be taught to students (in my opinion at least)

Office sofware has limits. Spreadheets are good for number crunching, but are not so good in: data cleaning (Practice), advanced graphics, spatial analysis (Practical-Theory), team work (Practice).

Office editors or Powerpoint/ are great tools but are not quality publishing of statistical results.

It is wrong to ignore the existence of modern open source tools and not introduce students to them. It is wrong not to introduce students to some (even elementary) programming, and sticking exclusively to point-and-click mode of work (ie spreadsheet).

I will try to demonstrate that using modern tools for statistical analysis is a feasible way to go. That (some) modern tools are not much more (prohibitive) difficult that office software (at higher than basic level)

Conclusion: less theory, more practice and common sense. Show student real *‘value chain’ of statistical analysis with all its problems (not covered nowadays):

Learnig curve comparison

Learning curve of programmable vs point-and-click (direct-manipulation) tools:

For simple task learning programmable tools is waste of time but as task complexity increases it is more beneficial to use same programmable tool (cf Scientific computing: Code alert)

Poor definitions example #1: Full Time Equivalence (FTE)

Number of students.

Who is a student?

Student is a person attending to a 3rd level status school in in the 3-stage education system (cf Educational_stage). The answer is still non-obvious as there are many forms of teriary education. For example:

The UNESCO stated that tertiary education focuses on learning endeavors in specialized fields. It includes academic and higher vocational education.

So according to the above definition the school do not belongs to tertiary education if its status is not academic and/or higher vocational. Example: Dance Academy or University for Elderly people (aka University of the 3rd Age). Both are popular in Poland.

In many countries there are some certification scheme. For example in Poland a school must apply (and get) a certificate to be regarded as high school (ie part of tertiary level of education)

Heads vs Majors

Student can be enrolled to more than one course (major). So for counting heads it is necessary to remove duplicates otherwise one would count majors not persons.

Part time studies

FTE stands for Full-Time-Equivalent, an approximation of the number of students who would be enrolled full-time

Full time equivalent (FTE) – FTE is based on student credit hours. It is obtained by dividing student credit hours by some a number of credit hours for full-time-study.

Conclusion: Majors, Persons or FTEs? Which is the best?

University of Utah/Office of Analysis, Assessment and Accreditation google:single multiple majors fte

Poor definitions example #2: measurement of tourism activity [concept of an Indicator]

Who is a tourist. According to Glossary:Tourism

Tourism means the activity of visitors taking a trip to a main destination outside their usual environment, for less than a year, for any main purpose, including business, leisure or other personal purpose, other than to be employed by a resident entity in the place visited.

According to the above definition to be regarded as tourist one has to change her/his accomodation place for less than one year (otherwise Eurostat would regard her/him as migrat)

The usual meaning (at least in Poland) is that tourist is travelling for leasure not to work. Poeple travelling to work has other needs/aims than those travelling to rest (they usually do not use hotels for example) so the above definition solves some problems but at the same time creates many others.

Number of tourists: do not distinguish between various form of turists, difficult to collect (who is a turist anyway?)

Various `number of’ tourist-oriented establishments (hotels, catering units, beds, nights spent) etc. They do not measure turists per-se but are highly related and more reliable (as easier to count).

Indicator of turist activity (by various tourist types).

Conclusion: measurement of tourism activity is not trival Other similar: internet user, migrant, unemployed person, illiterate person

Poor definitions example #2: measurement of tourism activity [data collection]

Tourism supply statistics (accommodation statistics): Data on rented accommodation ie. capacity and occupancy of tourist accommodation establishments in the reporting country. How collected? Registers?

Quirks of data collection: Data up to year 2015 inclusive refer to only those units that made the statistical reports. Starting of data for January 2016, the method of imputation data was implemented (ie replacing missing data with some (possibly meaningful :-)) values. (cf BDL)

Tourism demand statistics: Data on participation in tourism of the residents of the reporting country. How collected? Surveys?

Most of the time, data on domestic and outbound trips (where “outbound tourism” means residents of a country travelling in another country) is collected via sample surveys (cf Annual data on trips of EU residents and Tourism_statistics_-_top_destinations)

Regulations concerning data collection in turism (hundreds of pages): Glossary:Supply_side_tourism_statistics and EU regulation No 692/2011

So now we know what we are dealing with…

Poor definitions example #2: Example nights spent (demand side)

Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country of residence, 2017 Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country

Country of residence -> Foreign country (estimated data)

year 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
# 10173237 9609447 10064628 10620264 11876599 12471268 12992241 13757657 15579225 16705215
##  [1] 10173237  9609447 10064628 10620264 11876599 12471268 12992241
##  [8] 13757657 15579225 16705215

to be continued…

Poor definitions (final) example #3: dreadful aggregates

Indicators can be divided to hard indicators and soft indicators. Hard indicators denote hard facts while soft indicators are beliefs and intensions. For example number of hotels is a fact, while intension to stay abroad less than a year is not a fact but an intention. In Poland at least 80% respondents declares they intend to vote, while the true turnover never exceed 55%. In other words measuring something using soft indicators is prone to (significant) errors.

That not means that hard indicator is error-free. By definition it measures not the phenomenon but some proxy associated with the phenomenon.

With hard indicator we have precise measurement of imprecise measure. With soft indicator we have imprecise measurement of imprecise measure.

To cure (or hide) the problems aggregates of indicators are constructed, eiter as sums (indexes or formative) or as averages (factors or reflective. Indexes are more popular in economics while averages/factors are For example Gross National Product (GDP) is an index while (customer) satisfaction defined as some set of opinions on a product would be a factor.

Control question: what is measured with GDP?

Interlude: R and Rstudio

R is both programming language for statistical computing and graphics and a software (ie application) to execute programms written in R. R was developed in mid 90s at the University of Auckland (New Zealand).

Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines.

BTW why it is called so strange (R)? Long time ago it was popular to use short names for computer languags (C for example). At AT&T Bell Labs (John Chambers) in mid 70s a language oriented towards statistical computing was developed and called S (from Statistics). R is one letter before S in an alphabet.

Rstudio is an environment through which to use R. In Rstudio one can simultaneously write code, execute code it, manage data, get help, view plots. Rstudio is a commercial product distributed under dual-license system by RStudio, Inc. Key developer of RStudio is Hadley Wickham another brilliant New Zealander (cf Hadley Wickham )

Microsoft invest heavily into R development recently. It bought Revolution Analytics a key developer of R and provider of commercial versions of the system. With MS support the system is expected to gain more popularity (for example by integrating it with popular MS products)

Interlude: Measures of central tendency, dispersion and skewness with R

(univariate analysis)

The CSV file hotele_caloroczne_PL.csv contains data on number of all-season hotels in every county in Poland. First one has to load the dataset with the read.csv command:

d <- read.csv("hotele_caloroczne_PL.csv", sep = ';',  header=T, na.string="NA")

Computing measures of central tendency (with summary and/or fivenum)

summary(d)
##      teryt              powiat      hotele2012        hotele2017    
##  Min.   : 201   bielski    :  2   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:1005   brzeski    :  2   1st Qu.:  3.000   1st Qu.:  4.00  
##  Median :1636   grodziski  :  2   Median :  5.000   Median :  7.00  
##  Mean   :1721   krośnieński:  2   Mean   :  8.776   Mean   : 10.31  
##  3rd Qu.:2475   nowodworski:  2   3rd Qu.: 10.000   3rd Qu.: 11.00  
##  Max.   :3263   opolski    :  2   Max.   :158.000   Max.   :183.00  
##                 (Other)    :368   NA's   :1
fivenum(d$hotele2017)
## [1]   0   4   7  11 183

Computing mean:

mean(d$hotele2017)
## [1] 10.31053

And dispersion:

var(d$hotele2012); var(d$hotele2017)
## [1] NA
## [1] 244.8743
sd(d$hotele2012); sd(d$hotele2017)
## [1] NA
## [1] 15.64846

Second attempt (with no output/respective values was saved as variables var12sd17):

var12 <- var(d$hotele2012, na.rm=T); var17 <- var(d$hotele2017, na.rm=T)
sd12 <- sd(d$hotele2012, na.rm=T); sd17 <- sd(d$hotele2017, na.rm=T);

BTW:

c( mean(d$hotele2012, na.rm=T), mean(d$hotele2017, na.rm=T))
## [1]  8.775726 10.310526

Or more formally. There were 8.7757256 hotels on the average in every county in Poland in 2012 while in 2017 there were 10.3105263 hotels.

Interquartile Range aka IQR which is the range from the upper (75%) quartile to the lower (25%) quartile. IQR represents central 50% observations of a population. IQR is a robust measure of dispersion, unaffected by the distribution of data:

c( IQR(d$hotele2012, na.rm=T), IQR(d$hotele2017, na.rm=T))
## [1] 7 7

Finally we can equally easilly assess the skewenss:

library(moments)
c(skewness(d$hotele2012, na.rm=T), skewness(d$hotele2017))
## [1] 5.998884 5.899827

Distribution skewness is significant in both periods. Using (modified) Persons’ formula \((\bar x -D )/ \sigma^2\) we obtain:

library("DescTools")
(mean(d$hotele2017) - Mode(d$hotele2017) )/ sd17  
## [1] 0.4032682

Still the distribution is positively skewed, but the value of the coefficient is much smaller.

Better tools: for producing better charts

In spite of the fact that statistical charts are now ubiquitous in the media this topic is usually coverd marginally at most courses on statistics, probably because it is pretty hard to produce quality graphics with office software (complexity vs difficulty). This pitiful state of affairs can be changed with the introduction of modern tools. (How to make quality graphics is the main subject of my lecture BTW.)

Statistical charts can be plotted for the following three purposes:

Note: It is often recommended by some researchers to use charts at data cleaning stage of statistical analysis. I do not agree with it. Data cleaning can be automated and should not relay nor on manual work nor on visual inspection. Using programs to check data is more efficient and reliable procedure. It is also 100% replicable contrary to visual inspection.

A visual-art designer not statistician is a right person for the 1st purpose. I am not an art-designer so I will not tell you how to prepare eye-catching pictures. I am a statistician and I will concentrate on effictive graphical methods for statistical explanation/exploration. And by effective I mean that one (graphical) method is more effective than another if its quantitative information can be decoded more quickly/easily [Robbins 2005]

Types of charts

Some graphs are better than others:

Note: bar/line/pie charts were introduced by William Playfair in XVIII century. Dot plots were introduced by John Cleveland (1980s). Box-plots were introduced by John Tukey (1970s)

More Playfair’s charts can be found via google or at

Interlude: what, when and where

Before we continue with statistical graphicsa short 2 slides diversion on geocode standards used in statistics.

No doubt in every reliable survey the population has to be precisely defined ie 3 dimensions of every surveyed unit should be fixed: definition (what), time (when measured), space (where)…

I always repet to my students: if you look at some data (in the media for example), start from establishing if you know what, when and where. If no information (or reliable link–called source–to information) is provided on any of the fixed dimensions of data, treat this data as rubbish and do not waste time to use/analyse it.

Further dissemination of such defective data should be subjected to publicly prosecuted (joke)

I tried to show you already that what is complicated and often highly unreliable/arbitrary (the nature of the phenomenon or/and measurement difficulties).

What dimension much more simpler due to universal standard, ie. time. You gather data or for a certain moment (how many hotels are in use in 31st December 2018) or for certain period of time (how many beds were sold in these hotels in 3rd quarter of 2018).

Where dimensions in turn is usually based on administrative or statistical (geographical) units (country, state/province, county, community). But contrary to time dimension there is no universal or globally-accepted standard for geostatistical units. Usually such a standard is based on administrative system which is country-dependent.

The administrative division of Poland since 1999 has been based on three levels of subdivision (cf Administrative divisions of Poland. In 2001 as Poland became a member of European Union, EU regulations are part of national law system.

EU regulates everything, statistics included.

Conclusion: The pigs had to expend enormous labours every day upon mysterious things called “files,” “reports,” “minutes,” and “memoranda.” These were large sheets of paper which had to be closely covered with writing, and as soon as they were so covered, they were burnt in the furnace (George Orwell, Animal Farm)

Interlude: NUTS and TERYT

The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)

NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)

NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – county We would like to plot a chart showing number of hotels.

Poland is divided into 16 states (NUTS2) and 380 counties NUTS3 which are equal to administrative units. So on the average there ar 23.75 counties per state. NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )

There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce

The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW provice is Polish is “prowicja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every provice ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland

NUTS3 consists of 380 counties (called “powiat”). In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta“.”Stary" means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)

There is no NUT4 level but there is 3rd level of Polish administration used by GUS (Main Statistical Office). This 3rd level is called “gmina” (community).

There are (approximately) 2750 communities in Poland. As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each community has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina“.

TERYT is a Polish NUTS (developed some 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-community (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).

So you are now experts on administrative division of Poland, and we can go back to statistical charts…

Strip charts

A strip chart (strip plot) shows the distribution of data points along a numerical axis.These plots are suitable compared to box plots when sample sizes are small (because preserve more information about the data).

Example: Number of hotels in powiat by region (NUTS1, 2017):

The biggest potential problem with a dot/scatterplot is overplotting: whenever one has more than a few points, points may be plotted on top of one another. This can severely distort the visual appearance of the plot (left panel)

There is no one solution to this problem, but there are some techniques that can help: use smaller dots, use semi-transparent dots (right panel), use jitter.

Jitter—a small random noise added to data, is shown below (higher jitter on the right panel)

Histograms and kernel density functions

Histograms show the distribution of a set of data. To draw a histogram the numbers (observations) are grouped into bins (intervals or classes). There is a tradeoff between showing details or showing an overall picture. When bin width changes the scale at Y-axis changes as well (more bins less observations in each bin). Example number of hotels in Poland (2017):

ggplot(d, aes(x = hotele2017)) +
  geom_histogram(bins = nclass.Sturges(d$hotele2017))

Histograms with binwidth equal to 20, 10, 5 and 1 respectively:

Kernel density functions

ggplot(data=d) + geom_density(aes(x=hotele2017))

p1 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=0.25)
p2 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=1.0)
p3 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=2.0)
p4 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=8.0)
ggarrange(p1,p2,p3,p4)

Comparing distributions: box-plots

Box-plots are much better than histograms for comparing distributions of more than one data sets.

Construction of a (typical) box-plot: The middle bar is a median. Top/bottom bars of the rectangle shows the IQR (interquartille range is 1st and 3rd
quartille), the fanciful bars above/below rectangle called whiskers (google: whiskers mustache :-) are 1,5 times the IQR (or minimu/maximum if those values are less than plus/minus 1,5 IQR. The symbols above/below whiskers (usually open circles) are outliers (non typical/extreme values)

Note the trick: outliers are defined not as (for example) top/botom 1% fraction of values (every distribution would has outliers in such a case) but as values less/more than Me - 1,5IQR (distributions with medium variablity would not have outliers)

Example: age of Nobel-prize winners (cf The Nobel Prize API Developer Hub)

nlf <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");

ggplot(nlf, aes(x=category, y=age, fill=category)) + geom_boxplot() + ylab("years") + xlab("");
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).

Multiple histograms are too detailed (binwidth=5). It is impossible for example to establish which category has the youngest (on the average) laureate, or which category has an oldest one (economics and literature are candidates, but due to multimodality of literature laureates distribution it is difficult to assess this for sure…)

ggplot(nlf, aes(x=age, fill=category)) + geom_histogram(binwidth=5) +
    facet_grid(category ~ .)
## Warning: Removed 39 rows containing non-finite values (stat_bin).

Comparing distributions box-plots vs multiple histograms

Number of hotels in powiat by województwo (2017):

More jitter:

Boxplots are better:

Scatter-plots

A scatter-plot (aka scatter diagrams, xyplot) is a basic form used for two (quantitive) variables.

To see the relationship between variables, a line is can be fitted. Least square (LS) line which assumes linear relationship between variables, is fitted by minimizing the sum of squares of the residuals (residual is the difference between a data-point and a relevant line-point ie a point computed from the formula y = a +bx where x is the value of the x-axis variable.)

(Almost) each Poland is acctarctive for tourist but those counties which are at the seaside (north) or in the mountains (south) are special. There are 11 counties at the seaside (morze = sea) and 18 in the mountains (góry):

## 
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127494  -22228    4466   20779   84551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -52055      45219  -1.151   0.2793  
## y2017           5839       2324   2.513   0.0332 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60310 on 9 degrees of freedom
## Multiple R-squared:  0.4123, Adjusted R-squared:  0.347 
## F-statistic: 6.314 on 1 and 9 DF,  p-value: 0.03316
## 
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26224  -5432   2165   5616  25199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11071.5     5820.8  -1.902 0.086330 .  
## y2017          961.7      161.4   5.960 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13270 on 10 degrees of freedom
## Multiple R-squared:  0.7803, Adjusted R-squared:  0.7584 
## F-statistic: 35.52 on 1 and 10 DF,  p-value: 0.0001394

So each new hotel in the mointains on the average would attract 961.6 foreign tourists, while a new hotel at the seaside would attract 5838 foreign tourists (and both numbers are statistically significant at \(\alpha=0.05\):-) )

Alternatively loess curve can be used which do not assumes linearity but is parameters are not interpretable.

Scales

Logarithmic scale makes it possible to plot values with too wide range for a linear scale. Base 10 logarithms squeeze' the numbers more than base 2 logarithms (log10(100)=2 wile log2(100)=6.64. Moreover is the original scale contains multiplications of 10 use log10 to getnice’ log-scale while it contains multiplications of 2 use log2.

Logarithms transforms additive scale to `multiplicative’ one. Example (Nobel prize again):

dA <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");
nrow(dA)
## [1] 934
dS <-  subset(dA, (! bornCountryCode == "" )) # by country of birth
nrow(dS) # how many
## [1] 901

aggregate by bornCountryCode

Finally plot the resulting data using various Y-axis scales (arithmetic, log2 and log10)

Graphic perception tasks

From the best to the worst:

Angle judgement is not precise. Acute angles are underestimated while obtuse angles (greater than 90) are overestimated.

Area judgement is biased as well. It is impossible to distinguish small differences in area, while quite easy when the same date is plotted along common scale

The most accurate of graphic task is positioning along common scale

General design rules

Always include 0 in numerical axes

Never use more graphic feature than your data set has dimensions. For univariate analysis use length or color not both for example. (Well there are rare excepions to this rule see example below)

Pseudo 3D charts for 2D data should be forbidden as well and without any exception. Virtually no-one can read them.

The lie factor and data-ink ratio [Tufte]

!!! ADD !!!

Lie factor example

This giant guy (GG) in the middle is our ex-president. The guy next to him on the left is our current president Duda. Next to Duda is ex-rock star Kukiz, dark-horse of the elections. This is the cover (slightly modified) of influential polish weeekly magazine form Janury 2015, about five months before elections (which took place in may 2015).

The figures are claimed to be in-sync with the recent survey results (sort of a barchart). Could you figure-out from that chart about the proportion of scores of each candidate? How much the giant-guy outperforms the runner-up candidate? Which candidate is supported by this influential magazine (easy:-)?

The lie-factor details:

The line from shoes to top of the head equals (at certain size of course) 204mm for GG, 134mm for Duda and 42.5mm for ex-rock star. So \(204/134=1.5\) and \(204/42.5 \approx 4.8\). As \(44/29 \approx 1.5\) and \(44/9 \approx 4.8\) as well formally the lieFactor is perfect. But should one compares lengths or areas?

If one compares areas not heights, one get significantly different (and correct) results, namely: \((204 * 58) /(134 * 21)= 4.20\) and \((204 *58)/(42.5 *15) \approx 18.56\). Lie factor is \(4.2/1.5 =280\)% and \(18.56/4.8=387\)% respectively. Huge distortion

Moreover two more tricks were applied to boost GG. Can you see them?

BTW: the text in the pink frame claims: “figure ratios are consistent with april-may survey outcome.”" (But what exactly figure ratios means?)

Banking to 45

The ratio between the width and the height of a rectangle is called its aspect ratio.

The aspect ratio describes the area that is occupied by the data in the chart. A change in aspect ratio changes the perception of the graph. The question is which aspect ratio is the best.

We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.

Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.

Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).

Exercise: assess which slope is the steepest one and which is the smallest one?

ADD

Elementary spatial analysis (Heat maps/tematic maps)

Geocoding and reversegecoding

Diversion: tools for geocoding and reversegecoding

Diversion: tools for building (heat/tematic maps)

QGis

Example: Poland (population, incomes, distribution of)

Bivariate analysis

Example: tourist vs industry vs education (in Poland)

Timeseries analysis

Interlude: Crusaders, Knights and Malbork castle

First short explanation about the subject of the analysis ie famous Castle of the Teutonic Order in Malbork which is enlisted at UNESCO heritige list (cf UNESCO heritige list ):

Several religious military orders were formed in the Holy Land during the Crusades Templars, Hospitallers, Teutonic Knights

The Teutonic Knights or the Teutonic Order of the Hospital of St. Mary in Jerusalem, were known in Poland as Krzyżacy on account of the black cross they wore on their white coats.

Established in 1190 to protect German pilgrims in the Holy Land, the order was later transformed in order to fight heretics.

In 1226 the Teutonic Knights came to Poland, invited by Duke Konrad I of Mazovia to fight with the annoying pagan Prussian tribes invading Poland from time-to-tme from the north. Teutonic Knights conquered Prussia, exterminated the locals and founded a powerful state with Malbork (Marienburg or Mary’s castle in German) as its capital.

BTW: Kwidzyn in German is called Marienwerder (Mary’s meadow) and there were a lot more places named Marien-something (as Marien is St Mary in German)

BTW2: There is about 40km from Kwidzyn to Malbork :-)

Interlude: example of a very bad graphs

There is a research, peer-reviewed paper on tourist traffic in the castle’s museum of Malbork

The determinants of the tourist traffic in the castle’s museum of Malbork

Unfortunatel all charts in this paper contains elementary errors. Could you identify them?

if one insists on using piecharts (improved version):

or better, using bar/dot charts:

Even worse graphics (yes we can:-) )

Piecharts are notorious for obscurity:

What about this barchart (distribution of seats in Polish parliament (Sejm) after 2015 elections—50% majority is 430 seats)?

Remember dark-horse ex-rock start Kukiz? IMO his bar does not looks like being equal to 50 votes (minus 1.) PO-bar is peculiar as well…

Not mention about strange tilt to the left…

New workflow (finally): reproducible research

Sorry but why use all this strange stuff at all?

So you probably still wander why I am punishing myself with using such a odd system. The most important argument why I will present momentarily and it concerns the basic approach (philospohy if one has to be phatetic) of doing statistical analysis.

This mode (or concept) is called Reproducible Research (RR in short).

Serious statistical analysis is not one-off job. There is a value-chain as well as a life cycle of statistical analysis. Value chain means that there are distinct stages while life cycle that the same data/models are used for years and most statistical analysis do not start from the scrach but are based on data from the past augmented with new data. The problem is that the new data and model modifications should be in-sync with the past.

The make the problem worse, serious statistics should be also in-sync with the work of others (to ease or to make possible any meaningful (international) comparisons for example)

Reproducible research or how to make statistical computations more meaningfu

Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S.~Raymond, E.~S. The art of UNIX programming: Addison-Wesley.

Replicability vs Reproducibility

Hot topic: google: reproducible research = 158000

Replicability: independent experiment targetting the same question will produce a result consistent with the original study.

Reproducibility: ability to repeat the experiment with exactly the same outcome as originally reported [description of method/code/data is needed to do so].

Computational science is facing a~credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today. (Donoho D. et al 2009)

Australopithecus (Current practices)

Use Excel for data cleaning & descriptive statistics Excel handles missing data inconsistently and sometimes incorrectly Many common functions are poor or missing in Excel

Use SPSS/SAS/Stata in point-and-click mode to run serious statistical analyses.

Problems

Tedious/time-wasting/costly.

Even small data/method change requires extensive recomputation effort/careful report/paper revision and update.

Error-prone: difficult to record/remember a ‘click history’.

Famous example: Reinhart and Rogoff controversy Countries very high GDP–debt ratio suffer from low growth. However the study suffers serious but easy identifiable flaws which were discovered when RR published the dataset they used in their analysis (cf Growth_in_a_Time_of_Debt)

Homo habilis (Enhanced current practices)

Benefits

Improved: reliability, transparency, automation, maintanability. Lower costs (in the long run).

Solves 1–2 but not 3–4.

Problems: Steeper learning curve. Perhaps higher costs in short run. Duplication of effort (or mess if scripts/programs are poorly documented).

Homo Erectus (Literate statistical programming)

Literate programming concept: Code and description in one document. Create software as works of literature, by embedding source code inside descriptive text, rather than the reverse (as in most programming languages), in an order that is convenient for human readers.

A program is like a WEB tangled and weaved (turned into a~document), with relations and connections in the program parts. We express a program as a web of ideas. WEB is a combination of – a document formatting language and – a program language.

General idea of Literate statistical programming mimics Knuth’s WEB system.

Statistical computing code is embedded inside descriptive text. Literate statistical program is weaved (turned) into report/paper by executing code and inserting the results obtained. data/method changes.

Solves 1–4.

LSP: Benefits/Problems/Tools

Problems of LSP: Many incl. costs and learning curve

Tools:

Diversion: Github

New Tools (hipster part)

R/Rstudio for computing and data visualization

Github for enhancing team work

markdown for reproducible research

New practice [recap]

Summary: Resources

cheatsheets QGIS tutorials gis.stackexchange.com

Summary Data banks

https://git.generalassemb.ly/briancwq/classes/blob/master/week-01/lessons/python-descriptive_statistics_numpy-lesson-master/archive/LESSON.md

Questions

Thanks